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Abstract 

In several modern applications, ranging from genetics to genomics and neuroimaging, there 
is a need to compare observations across different populations, such as groups of healthy and 
diseased individuals. The interest is in detecting a group effect. When the observations are 
vectorial, real-valued and follow a multivariate Normal distribution, multivariate analysis of 
variance (MANOVA) tests are routinely applied. However, such traditional procedures are not 
suitable when dealing with more complex data structures such as functional (e.g. curves) or 
graph-structured (e.g. trees and networks) objects, where the required distributional assump- 
tions may be violated. In this paper we discuss a distance-based MANOVA-like approach, 
the DBF test, for detecting differences between groups for a wider range of data types. The 
test statistic, analogously to other distance-based statistics, only relies on a suitably chosen 
distance measure that captures the pairwise dissimilarity among all available samples. An 
approximate null probability distribution of the DBF statistic is proposed thus allowing in- 
ferences to be drawn without the need for costly permutation procedures. Through extensive 
simulations we provide evidence that the proposed methodology works well for a range of 
data types and distances, and generalizes the traditional MANOVA tests. We also report 
on the analysis of a multi-locus genome-wide association study of Alzheimer's disease, which 
has been carried out using several genetic distance measures. The DBF test has detected 
causative genetic variants previously known in the literature. 



1 Introduction 

Multivariate analysis of variance (MANOVA) techniques arc routinely applied to compare real- 
valued multivariate observations collected in two or more populations and test whether the mean 
vectors in those groups have been drawn from the same underlying sampling distribution (see 
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Anderson (1984) and Krzanowski (2000), for example). These methods are apphcable in a wide 
range of scientific areas. For instance, in genomics, microarrays allow the researcher to investigate 
the behaviour of thousands of genes simultaneously under various conditions. A common task 
consists of observing the expression levels of a group of genes sharing the same biological function, 
and assessing whether the mean expression levels in these gene sets differ across experimental 
conditions. Compared to a univariate approach involving only one gene expression measurement 
at a time, a MANOVA test may capture potential gene interactions and hence provide a more 
powerful and biologically meaningful way of detecting subsets of differentially expressed genes 
(Szabo et ai, 2003; Tsai and Chen, 2009; Xu and Cui, 2008; Shen et ai, 2011). 

In several applications, however, MANOVA tests may be inappropriate for at least two main 
reasons. Firstly, when the observations are multivariate, the multivariate normality assumption 
may not necessarily hold, e.g., if the observations are heavily skewed or are discrete- valued. Sec- 
ondly, the observations may not necessarily be represented by vectorial data structures. In an 
increasingly large number of applications, the data being compared across groups are functional 
(e.g. curves) and graph-structured (e.g. trees and networks) objects. This is indeed the case with 
several applications we have a particular interest in, of which the following are three representative 
examples: 

• In genome-wide association (GWA) studies, the allele frequencies observed at multiple ge- 
netic markers (single nucleotide polymorphisms, or SNPs) in healthy and diseased subjects 
are compared in order to identify genetic variants that arc associated with disease risk. Al- 
though GWAs have emerged as popular methodologies for gene mapping, traditional analyses 
that rely on testing one individual marker at a time are believed to have low power to detect 
small genetic effects. They have also been shown to have limited reproducibility, and are 
unable to detect effects which are due to gene-gene interactions. For these reasons, sets of 
multiple markers (i.e. markers observed at multiple loci) can be used instead, giving rise to 
discrete- valued vectors to be compared across groups (Wu et ai, 2010). 

• In longitudinal microarray experiments, repeated measurements of a gene's expression level 
are recorded at different time-points, and the resulting time course is often modelled as a 
functional object or curve, i.e. an infinite-dimensional data structure (Storey et at, 2005) 
(see Figure 1 for an example of such a functional dataset). The resulting gene curves observed 
under two or more experimental conditions are then compared using a test statistic that 
accounts for the functional nature of the data (Berk et ai, 2012; Berk and Montana, 2009). 

• In neuroimaging studies, there is increasing interest in comparing individual brain connec- 
tivity networks inferred in subjects that belong to two or more clinically relevant groups, 
e.g. diseased and healthy individuals. A network is a collection of nodes (vertices) and links 
(edges) between pairs of nodes. Such brain connectivity networks describe the set of con- 
nections in the neural system, or connectome, in which nodes could be neurons or cortical 
areas, and edges could be axons or fibre tracts. Thus, edges could refer to the structural 
connectivity of a neural network, or could signify correlations between the activity patterns 
of nodes forming functional connectivity (Rubinov and Sporns, 2010). Although univari- 
ate test statistics can be used to compare particular summary features extracted from the 
graphs, such as centrality measures, there is a need to compare individual graphs directly. 
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Since standard MANOVA techniques assume real- valued vectorial data representations to com- 
pare mean vectors, they are not always suitable. Moreover, in many of these applications, a very 
large number of tests are needed to be simultaneously performed within the same experiment. 
For instance, in genomic experiments such as those involving gene sets or time courses, as well 
as GWA studies, the number of comparisons can range from many tens of thousands to millions. 
Computationally efficient inferential methods arc therefore required. 

In this article wc consider the problem of comparing two or more independent random samples, 
representative of some underlying populations, by using a distance-based analysis of variance 
approach. As in traditional MANOVA, the total variability observed in the random samples is 
decomposed into the sum of between-group and within-group variability. The decomposition only 
relies on a suitable distance measure defining how dissimilar any two samples are, and does not 
require the computation of means in each group; both attractive properties when dealing with 
non-vcctorially structured objects. For a given application, a suitable distance measure is defined 
depending on the nature of the data (whether vectorial or not, for example), and on the specific 
objectives of the study, since each distance measure captures a different feature of the data. We 
discuss particular distances for genetic and functional data later in this article. Pairwise distances 
between all available samples are computed and stored in a square, symmetric distance matrix, 
with samples deemed more similar if the distances between them are small. 

A number of distance-based tests for no group effect have been proposed in the literature 
as generalizations of classical MANOVA procedures, and are applicable when the MANOVA as- 
sumptions are violated, such as with data of the type described above. These include the Mantel 
test (Mantel, 1967), the MRPP test (Mielke and Berry, 2007), and the more recent DBF test 
(Minas et ai, 2011). Each statistic utilizes some notion of within- and between-group variability 
in terms of distances: Mantel considers between-group variability, MRPP considers within-group 
variability, and DBF considers the ratio of between- to within-group variability. An important, 
and in many cases limiting, feature of such distance-based tests is that exact distribution theory 
of the statistics under the null hypothesis of no group effect is unavailable. This is due to the 
wide variety of distance measures available for different data types leading to different distribu- 
tional characteristics of the distances (Mantel, 1967). Due to this limitation, non-parametric and 
computationally intensive statistical procedures based on permutations are typically applied, since 
they do not rely on any distributional assumptions. The only requirement is that samples are ex- 
changeable across groups under the null hypothesis of no group eflFect (Pesarin and Salmaso, 2010). 
Statistical significance of an observed statistic is assessed by comparing the estimate against the 
discrete sampling distribution obtained by permuting the samples across groups and recomput- 
ing the statistic each time. For each permutation, this is equivalent to permuting the rows and 
columns of the corresponding distance matrix simultaneously, and recomputing the statistic with 
the permuted distance matrix (Mielke and Berry, 2007; Minas et ai, 2011). Thus the distance 
matrix needs only be computed once, and inferences can be drawn based on this matrix. 

Despite being routinely adopted, such a permutation approach sufl[ers from some important 
limitations. Even for low sample sizes, exact p- values require a very large number of permutations 
to be obtained; for instance, for 12 observations, the total number of permutations required is 
0(10*). Exact p- values are often impossible to obtain due to computational and time constraints, 
and Monte Carlo procedures are used instead whereby a smaller number of randomly chosen 
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permutations arc used to approximate the p- values. This approach inevitably introduces sampling 
errors (Berry and Mielke, 1983). Moreover, whereas large p- values can be well approximated by 
a Monte Carlo approach, smaller ones will be estimated less accurately (Mielke and Berry, 2007; 
Knijnenburg et al., 2009). For example, in order to obtain a permutation p- value within 10~^ of the 
true p- value, it has been shown that O(IO^) permutations are required. In several real applications, 
we have observed that less than O(IO^) permutations are typically used, especially when a very 
large number of statistical tests must be performed simultaneously. In these cases, the total number 
of permutations per test may be vastly limited even when ad-hoc parallel implementations that 
run on high-performance computing facilities are being used. 

In order to deal with these limitations, in this work we propose a closed-form, continuous 
approximation to the null sampling distribution of the DBF statistic (Minas et ai, 2011), which 
was originally proposed for the analysis of functional data arising from longitudinal microarray 
experiments. Our main contribution here is to enable distance-based analysis of variance testing 
in a wide range of applications, in a computationally efficient manner without the need for per- 
mutations. Using extensive simulations, we demonstrate that the proposed methodology works 
well for several different data structures, including functional and discrete-valued vectorial data, 
in addition to several distances, and that it provides a good approximation of the permutation 
p-values. We also show that traditional ANOVA and MANOVA tests are special cases of the 
proposed approach when their assumptions are satisfied. For these special cases we show that the 
proposed continuous distribution approximates the true distributions well. The computational 
cost of analyzing large datasets is thus reduced significantly without compromising the accuracy 
of the results. 

A second contribution of this work consists of applying of the proposed methodology to a 
multi-locus GWA study of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort (see 
for instance, Vounou et al. (2010, 2012)). To the best of our knowledge, this is the first multi-locus 
case-control genetic study on this cohort. By using a range of genetic distances we show that the 
DBF test identifies previously known genetic variants associated with Alzheimer's disease, and 
discuss a comparison with a distance-based method that has been specifically designed for the 
analysis of such large-scale case-control studies. 

This article is organized as follows. In Section 2, we briefiy review the classical variance 
decomposition and MANOVA tests, followed by a detailed account of the proposed distance-based 
variance decomposition. The corresponding DBF statistic is defined, and theoretical connections 
with classical MANOVA and ANOVA tests are included. Section 3 describes how the sampling 
distribution of the DBF statistic exhibits skewed characteristics, and we justify the use of the 
Pearson type III distribution for its approximation. Approximate distribution theory for the DBF 
statistic then follows in Section 4. We provide evidence that the approximation works well for 
a range of distances and data types in Section 5, including empirical comparisons with ANOVA 
and MANOVA. Finally, in Section 6 we showcase the generality and applicability of the proposed 
approach by analyzing the ADNI dataset and reporting on the genetic variants we have identified. 
Conclusive remarks are given in Section 7. 
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2 A distance-based test of equality between populations 



2.1 Traditional variance decomposition and MANOVA tests 

We first consider the case of N independent vectorial and real- valued observations {yi}^i in M.^ 

tg}^=i- Each group is of size Ng 



that belong to one of G populations with means {tJ.g}9=i- Each group is of size Ng such that 



= ^ff- We arc interested in testing the null hypothesis that all the population means are 

equal, versus the alternative hypothesis that some of the means arc not equal. 

Several standard statistical tests are based on the multivariate analysis of variance (MANOVA) . 
The overall sample mean is given by y = X^iLi Hi f-^id each within-group sample mean is given 
hy ijg = ^ Vilgi for g = 1, . . . , G where Igt is an indicator variable taking the value 1 if 
observation is in group g and otherwise. The P x P total sum of squares matrix is given by 

N 



T^Y.^Vr~v) {y^ - y) 



and can be partitioned into the sum of between- and within-group sum of squares matrices, 

G ON 

B ^^Ng{yg- y) {yg - yf and W = ^Y1 {yi - yg) {yi - ygf Ig,, 
respectively. 

Existing MANOVA test statistics make use of different elements of this total sum of squares 
decomposition. For G = 2, the well-known Hotelling's statistic is given by NiN2{N — 2){yi — 
y2fW-\yi-y2)lN. For G > 2, MANOVA statistics include Wilks' Lamda, det(W)/det(T), the 
Pillai trace, tr (T~^B), and the Lawley-Hotelling trace, tr (W~^B) (Rencher, 2002; Krzanowski, 
2000). Under the assumption that the observations are independent and identically distributed 
from a multivariate Normal distribution with mean fi and variance-covariance matrix S, some 
distributional results are available. For instance, for G = 2, Hotelling's statistic (multiplied by 
a constant depending on and P) has an exact F distribution under the null; 

N - P-l 2 
(A^-2)P^ ~^^P.A,_P_i, 

where -Fp.at-p-i denotes the F distribution with degrees of freedom P and N — P —1. For G > 2, 
Wilks' Lamda, the Pillai trace and the Lawley-Hotelling trace can all be similarly transformed to 
statistics which are well-approximated by the F distribution with degrees of freedom dependent 
on N, G and P (see, for example, Rencher (2002)). 

When < P, as is typically the case with genomic datasets, the classical MANOVA tests 
cannot be applied directly. This is because the T and W matrices are singular, and so have 
at least one zero-valued eigenvalue and cannot be inverted. Several high-dimensional MANOVA 
settings have been considered in the literature, and tests of equality between groups have been 
proposed, some using traditional MANOVA statistics with generalized inverses (see Srivastava 
(2007) and Schott (2007) for good reviews, and Tsai and Chen (2009) for an application to gene 
expression data). One of the first tests was proposed by Dempster for G = 2 (Dempster, 1960), 
where an F-typc statistic defined as the ratio of within- to between-group variability was proposed. 
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This statistic was generalized for G > 2 and named the Dempster trace criterion four decades later 
by Fujikoshi et al. (2004). They noticed that the trace operator could be applied to the B and W 
sum of squares matrices to yield equivalent expressions to those proposed by Dempster. Although 
not stated explicitly, they summarize the total variability exhibited by the P variables by tr(r). 
They then partition this into between- and within-group components via tr(T) = tr(B) -|-tr(VF), 
which follows by applying the trace operator to the decomposition T ^ B + W. The Dempster 
trace criterion was then defined as tr(B)/tr(VF). and a transformation of this statistic was shown 
to be asymptotically Gaussian. In the next section we describe a further elaboration of this 
approach which only relies on pairwise distances among samples. 

2.2 The distance-based variance decomposition 

As before, we consider N independent observations {yi}fLi belonging to G groups each of size 
Ng for g ~ 1, . . . , G. Now, however, we place no restriction on the nature of these observations; 
they can be of any form, for example, scalar-valued, vector-valued, functional, graph-structured, 
or images, amongst others. The fundamental assumption is that we are able to define a distance 
measure d, which may be either semimetric or metric, which quantifies the dissimilarity between 
any pair of observations in the random sample. All pairwise distances are then arranged in an 
N X N distance matrix A = {d{yi,yj)}f'j^^. The choice of distance depends on the type of data 
and scientific problem at hand. We are then interested in testing the null hypothesis that there is 
no difference between the observations from different groups with respect to the chosen distance 
measure d, against the alternative that a difference exists between any two groups. That is, under 
the null, all observations are assumed to belong to the same population. 

First, we introduce a generalization of the variance decomposition approach using pairwise 
distances. Consider the quantity tr(T) associated with a set of vectorial observations. It is a 
measure of spread found by summing the squared Euclidean distance of each observation to the 
population mean vector. This quantity can be equivalently written using only pairwise Euclidean 
distances between observations, as follows: 

N 

tr(r) = Y.^y^~yf{y^-y), 

i=l 

-.AT ^ N ^ N N 

i—l i — 1 i—1 i—1 

^ N N 

1=1 j=i 

^ N N 

i=\ j=i 

where denotes the Euclidean distance. Thus, tr(T) is proportional to the sum of squared inter- 
point Euclidean distances between all TV observations. This well-known connection shows that the 
total variability of a given set of vectorial observations, traditionally found using the population 
mean, can be computed using only the inter-point Euclidean distances (Gower and Krzanowski, 
1999; Anderson, 2001; Minas et ai, 2011). In an analogous manner, the within- and between- 
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group variability quantities tr(W^) and tr(B) can also be written in terms of squared Euclidean 
distances, as follows: 



G N 

tT{w) = E E (y» " ^s)^ ^ hi 

3=1 t=l 



G I ^ N ^ N ^ N N \ 

g=l \ i=l j=l f i=l 1=1 j 

G I ^ N N \ 

= E U]^ E E " y^)^ " 

9=1 \ ^ »=1 J = l / 

G W TV 

= oEEE'^l;(y»'y^)- 



G W TV 

2 AT 

9=1 «=1 J = l 

and since tr(B) = tr(T) — trCH^), we obtain 

AT TV ^ G N N 



4=1 j = l g=l 'i=l j = l "'f 

4=1 j = l \ 9=1 " / 

Generalizations of these quantities can be defined by replacing the Euclidean distance, d^, 
with any distance d. Thus we can define the total variability of a set of observations \yi\f^x with 
respect to distance d as 

= ^EE^'(y^'2^j)' 

4=1 i = l 

the within-group variability with respect to d as 

^ G N N 
^A = ^EEE^'(^''2/.) ^ 

9=1 j=i i=i ^ 
and the between-group variability with respect to d as 



4=1 i=l \ g=l 9 / 



The total variability in the data captured by can hence be decomposed into the sum of 
two components quantifying within- and between-group variability. That is, Ta = VFa + ^a, 
analogously to the decomposition tr(r) = tr(W) -I- tr(S). 

We can write Ta, -Ba and VFa more compactly in matrix form by introducing the N x N 
Gower's centered inner product matrix (Gower, 1966), defined as 



Ga = ( /TV - ^Jtv) (-^A^) U - ^Jtv 
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where In is the identity matrix of size iV, Jjv is the square matrix of ones of size N and 
{In — jjJn) is the eentering matrix. This matrix contains ah the information on the inter-point 
distances between the N observations, and is sueh that its trace equals Ta; 



tr(GA) = ti[[ In- — J 



IN 



1 



27V 



tr ( Jjv A^) 



N N 



i=l 3 = 1 



Therefore we rewrite Ta more conveniently as tr(GA)- For Wa and Ba we define the centered 
N X N matrix of constants encoding group membership of each observation to one of the G groups 

as 

1 



1 T 



N 



Jn, 



\ W^-^NgJ 

where Ja is the square matrix of ones of size a. Since this matrix is centered, we have that 



In — '^•^Nj He (in - Ji^j-^N 



and we use this fact in the evaluation of the quantity tr{IIcG to derive expressions for and 
Sa in terms of G a • We have 



tr(i?,GA) 



tr 



V 



\ 



■WE J No) 



\ 



N 



IN 



\ 



( ( WT'^'^i 



2N 



:ti (JnA^) 



- :rtr 



1 T 
■N^'JN-2 



I 

A2 



N N 



\ 

G 



J 



1 — ■ ^ ^ ^ N N 

^ E E ^'(y- 2^^) - 2 ^ aT 5Z E dHy^,y,)Ia^h 

g=i a 



2N 



i=l 3 = 1 

Ta-M^a, 



and since B/^ = Ta — Wa, we have that Sa = tr(i?cGA)- Also, since Wa = Ta — -Ba, we find 
that = tr {{In - Hc)G^). 
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2.3 The DBF test statistic 

Making use of the distance-based variance decomposition above, a distance-based test statistic 
has been defined to test the nuU hypothesis of equality between groups with respect to the chosen 
distance measure. This statistic, denoted DBF, is of the form 

p = = ti'(-H'eG'A) X 
^ VKa tr ((/at -iJjGA)' 

and was originaUy used to compare functional data arising in genomic experiments (Minas et ai, 
2011). Analogously to the Dempster trace criterion and Lawley-Hotelling trace statistic, this 
statistic considers a ratio of between- to within-group variability. Larger values of this statis- 
tic provide evidence against the null hypothesis, as larger between-group variability and smaller 
within-group variability suggest that observations in the same group are more similar than obser- 
vations in different groups. A statistic of similar form was also proposed by Anderson (2001) for 
application in ecology, but with degrees of freedom divisors G — 1 and — G in the numerator 
and denominator, respectively. 

When the observations are P-dimensional vectors, it has been shown that is related mono- 
tonically to the Lawley-Hotelling and Pillai trace MANOVA statistics for G > 2 (Minas et ai, 
2011). For instance, on defining the distance matrices Aw = {dw{yi,yj)}fj^i with d'^{yi,yj) 

It, 3 = 

it was shown that 



{yi - yj)^ W 1 {yi - yj) and At = {dxiyi, yj)}fj=i with d|(y», %) = {yi - y^f T ^ [y, -y^), 



Lawley-Hotelling = PF/^^^ and Pillai trace 



1 + (1 - P)FAr ■ 



For G = 2, it was also shown that is monotonically related to Hotelling's statistic via the 
equation 

2 _ {N ~ 2)PFa^ , 
~1 + (1-P)Fa/ 

Expanding on this relationship between and ; since we know that {N — P—1)T^/ {{N — 2)P) 
has an exact F distribution under the null, it follows that 

{N-P- 1)Fa. 
1 + (1-P)Fa. 

That is, a transformation of Fat follows the exact F distribution with degrees of freedom P and 
N — P — 1. As a special case, when the data consists of scalar observations and the Euclidean 
distance ds, is applied, Fa^ is identical to the classical one-way AN OVA F statistic, ignoring the 
degrees of freedom divisors G — 1 and — G in the numerator and denominator, respectively. 
Thus, 

(Anderson, 2001; Minas et ai, 2011). 

Given an observed value of the test statistic. Fa, computed for any suitably chosen distance 
measure d, inference can be carried out using a non-parametric approach. That is, the p- value 
can be found using permutations. Given N-^ permutations tt G H, where tt is a one-to-one map- 
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ping of the set {1, . . . ,N} to itself, the set {-pA^jTren is generated by recalculating Ba for each 
permutation, denoted -Ba„ , and using the monotonic relationship 



(3) 



Ta - ^A, 



Ta is fixed for all permutations so that permutated values of i^A SuTC monotonically related to 
permuted values of _Ba- The p- value is then computed as the proportion of the N^^ permuted 
statistics greater than or equal to the observed -Fa, i-e.. 



Clearly, this is a one-sided test, since only larger values of Fa provide evidence against the null. 

As an alternative to this expensive permutation-based testing approach, we propose an approx- 
imate distribution for the null sampling distribution of Fa in Section 3. From this approximate 
distribution p-values can be well-approximated without the need for permutations. Since Fa is 
related to _Ba via (3), we first approximate the null distribution of _Ba- 

3 Distance-based between-group variability 

3.1 Skewed characteristics 

For general data structures and distance measures, the null sampling distribution of DBF test 
statistic (1) is unknown. The reason for this is that the between-group variability quantity, Ba, 
featuring in the statistic will, in general, follow some unknown distribution which depends on the 
specific distance measure being used (Mantel, 1967). On denoting the (i, j)"^ element of He by hij 
and recalling that He is centered, Ba can be expressed as the weighted sum of squared distances 



tribution, Sa would be a weighted sum of correlated and uncorrelated random variables, whose 
distribution would be difficult to evaluate. For instance, the problem of evaluating the sum of 
correlated and uncorrelated Chi-squared and Gamma random variables has been considered ex- 
tensively (see, for example, Solomon and Stephens (1977), Kourouklis and Moschopoulos (1985)). 
Although it has been argued that Ba has the appearance of a U-statistic which is asymptotically 
normal (Mantel, 1967; Hoeffding, 1948), in our experience with different vectorial and non- vectorial 
data types arising in real applications, even for large samples sizes, i?A often appears to be skewed 
to various degrees. 

Further insights into this problem can be obtained by exploring the empirical permutation 
distribution of for a number of real datasets involving different data structures and distances. 
As an illustration, we consider three different examples: 

(i) Vectorial and real- valued data: the data consists of the P ~ 50 gene expression measurements 



p- value 



#f A. > 




Thus even if each (P 



{yi,yj), for i ^ j, 



was assumed to be a random variable with known dis- 
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observed on = 103 biological samples from the Novartis multi-tissue datasct described in 
Monti et al. (2003). In this case, G = 4, corresponding to four different tissues. For this 
dataset, we considered the Euclidean, Mahalanobis and Manhattan distances. 

(ii) Vectorial and discrete-valued data: the data consists of P = 5 randomly selected SNPs 
observed on = 254 samples from the ADNI dataset (see Section 6 for further details). 
The observation of each sample at each SNP is the number of minor alleles, taking one value 
in {0, 1,2}. In this case, G = 2, corresponding to the two groups being compared, healthy 
controls and Alzheimer's disease patients. Here we used the identity-by-state (IBS), Rogers 
and Tanimoto I, and Sokal and Sneath genetic distances. The IBS distance compares the 
number of minor alleles at each SNP in the set of SNPs, while the Rogers and Tanimoto 
I and Sokal and Sneath distances use a function of the total number of matches of minor 
alleles across the whole set of SNPs. See Appendix for further details. 

(iii) Functional data (curves): the data consists of = 18 gene expression functional data repli- 
cates for a randomly selected gene in a dataset on M. tuberculosis analyzed by Tailleux et al. 
(2008). In this case, G = 2, corresponding to two different types of cell, and replicate time 
courses were observed at 4 time-points for each type of cell. These were smoothed via cu- 
bic smoothing splines to yield the 18 replicate curves (Minas et al., 2011). Figure 1 shows 
observed time courses and their fitted curves for two randomly selected genes. The L2, 
Visual L2 and Curvature distance measures were applied to this dataset. The L2 measure 
captures the difference in magnitude between curves, the Visual L2 measure captures their 
scale-invariant differences in shape, and the Curvature measure captures their difference in 
rate of change regardless of direction. Sec Appendix for further details. 



RPL6 SLC22A18 




4 18 48 4 18 48 

Time Time 

Figure 1: Replicate gene expression time courses modeled as time-dependent curves for genes RPL6 and 
SLC22A18 of the M. tuberculosis dataset; black for dendritic cells and grey for macrophages. The points 
represent the original gene expression time course measurements. 
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(a) Vectorial, real-valued: 
Euclidean 



(b) Vectorial, real-valued: 
Mahalanobis 



(c) Vectorial, real-valued: 
Manhattan 




(g) Functional: L2 (h) Functional: Visual L2 (I) Functional: Curvature 




I 1 1 1 1 I 1 1 1 1 I 1 1 1 1 — 

20 40 60 80 0.0 0.1 0.2 0.3 0.4 Oe+00 2e-06 4e-06 

Ba Ba Ba 



Figure 2: Sampling distributions of Ba obtained by using 10 Monte Carlo permutations for three different 
data types and corresponding distances, (a)-(c) Vectorial and real-valued gene expression data with 
— 103. (d)-(f) Vectorial and discrete-valued SNP data with A'^ = 254. (g)-(i) Functional representation 
of longitudinal gene expression data with N = 18. Overlayed is the proposed approximate probability 
density function described in Section 3.2. 

The exact permutation distribution of in each case would be given by the set {B^^ }7ren 
where 11 contains all A^! permutations tt of the elements of {1, . . . ,N}. For large N, due to the 
computational effort required in enumerating all possible permutations, the exact distribution is 
generally unavailable. Figure 2 shows the approximate sampling distribution of B^ obtained by 
using 10^ Monte Carlo permutations for each of the data types and distance measures considered. 
In almost all cases the distributions are heavily skewed. Remarkably, this is also the case for large 
sample sizes. 



3.2 A Person Type III approximation 

In order to approximate the exact permutation distribution Ba, we propose an approach based 
on moment matching. Given the skewness often observed in real datasets, we assume that Ba 
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follows a Pearson type III distribution. This is a type of Gamma distribution encompassing other 
distributions as special cases, such as the Exponential and Normal distributions (Josse et al., 
2008) , and has been successfully used for modeling skewed sampling distributions of various other 
statistics (Berry and Mielke, 1983; Kazi-Aoual et at, 1995; Josse et ai, 2008). 

The first three moments of the exact permutation distribution of i?A are given by 

/^^^AH^ ^"' = ^ 2^5^. -/^s, and 7B = , 

■ Tren ■ iren ^ 

respectively. By evaluating the expressions analytically, closed form manipulations of the three 
moments have been derived allowing their efficient computation for iV > 6 without the need for 
permutations (Kazi-Aoual et al, 1995). These closed form expressions require only the square, 
symmetric and centered matrices He and Ga- The mean and variance, for example, are given by 

AJh , 2 2 {{N - 1) A2 - Aj) {{N - 1) B2 - Bf) 
HB = -TT — - and ag - 



N-l {N -IfiN + 1){N -2) 

{N{N + l)A;-{N -\){Al + 2A2)) 
{N + l)iV(A^ - l)(iV - 2)(iV - 3) 

X {N{N + 1)B3 - (iV - 1) {Bl + 2B2))] , 

where Ai = tr(iJc), A2 = tr (HcHc), A3 = tT{H^), Bi = tr(G'A), B2 = tr(GAGA) and 
B3 = tr (G^), where, for matrix A = {oylf^j^i, — {afj}^j^i- The expression for skewness is 
much more involved; the reader is referred to Kazi-Aoual et al. (1995). 

On standardizing Ba by subtracting fiB and dividing by crs, the Pearson type III distribution 
can be parameterized by only the skewness parameter, jb- That is, 

B^^ = ^^^^PT,nijB), 

(7B 

where PTm denotes the Pearson type III distribution. By assumption of this model, the support 
of random variable is given by [—2/73, 00) if 75 > 0, and (—00, —2/^b\ if 7s < 0, and we 
denote the cumulative distribution function (CDF) of by ^s|^(fo;7B), and probability density 
function (PDF) of by /s|^(6;7i3). The PDF fB=^{b;iB) is defined by 

(2/7b)^ /2 + 7£^V'~^"^^^" f 2(2 + 7b6) 
r(4/7|) I 7B J '''H 7| 

for 7b > and ~2/jb < b < 00, where r(-) denotes the usual Gamma function, 
(-2/7b)'/^^ /-(2 + 7b6)\^'"''^^/''^ /-2(2 + 7b6) 



r(4/7|) V IB J '""^V 7| 

for 7b < and —00 < b < —2/^b, and the standard Normal distribution for 7b = (Mielke and Berry, 
2007). 

In practice 7b is not equal to zero exactly, as exhibited for some real datasets in Figure 2. 
We see that all sampling distributions, and corresponding approximate PDFs. are skewed to some 
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degree. Thus in the next section we only consider the cases where 75 > and 7b < 0, and do not 
consider the trivial case of 75 = 0. 



4 Approximate null distribution of the DBF statistic 

The DBF statistic F/^ is related to the standardized distance-based between-group variability 
quantity via the one-to-one function h : M> defined by 



+ , (4) 



with inverse h ^ : Fa ^ B'^ defined by 



(5) 

We aim to derive the approximate null distribution of F^ in terms of the distribution of B'^ via 
transformation h, which is required to be continuous over the given supports of B^. However, it 
is not continuous in the positive plane at /3 = (Ta — ^J'B)/o^B because Ta = tr (Ga) > I^b due to 
tr(i?c) = 1- Since the boundaries of both supports of B^ depend on the skewness, the position of 
/3 may or may not affect the continuity of the distribution of B^ over the given support. We thus 
consider dealing with the discontinuity separately for both supports of B'^, i.e. for both cases of 
skewness. 

First consider the positive skewness case, where the support of B^ is [—2/jb,oo). Since 
7b > 0, —2/jB is negative and the discontinuity shows itself as B^ increases form —2/73 to 00. 
Figure 3 (a) shows how Fa behaves as a function of B^ over this support; Fa is an increasing 
function of B^ on both sides of the discontinuity at /3. 

Wc thus divide the support of B^ into the two regions, namely 

— ,(3) and (/3, 00) , 
.7s / 

where the equivalent supports of Fa are given by [a, 00), where 

^ _ 7_bMb - 

7B (Fa - Ms) + 2a b 



(6) 



satisfies h (a) = -~2/-fB, and (—00, —1). We can show by contradiction that a > —1, since a < — 1 
implies T^jb 0, and by definition both Fa and jb are positive. In these regions we can apply 
the transformation since there are no discontinuities, and define the CDF of Fa in terms of the 
CDF of B'^ by 

^ . /-^bj,(/^-H/);7b)--Fb^(/3;7b) -00 < / < -1 

[i + Fb^ (/i~H/);7B) -Fb|,(/3;7b) a</<oo 

for jB > 0. The relevant derivations arc in the Appendix, including a proof that this is a valid 
CDF. 
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(a) yb > 



(b) Yb < and a < -1 




Figure 3: (a) Fa and sxe monotonically related everywhere except at = /3 for 7_b > over the 
support [— 2/7s, oo). (b) Fa and are monotonicaUy related everywhere except at Ba ~ P for 73 < 
over the support {—00, —2/75] when a < —1. 



Now we turn our attention to the negative skewness case, where the support of is (—00, —2/75] 
In this case, 7b < so that —2/73 is positive. The discontinuity at = f3 thus only needs to be 
considered if (3 is to the left of —2/75, otherwise it can be ignored since it is not included in the 
support of B^. We consider these two cases separately. First consider the case where /3 < —2/73. 
We have that 

7b ctb ctb 7b 



since > fig, from which we find 



< ^^^^^ < 1. (8) 



Applying this with the equation for a given by (6) yields a < — 1. Thus we define the occurrence 
of this first case when a < — 1. Figure 3 (b) shows how Fa behaves as a function of B^ over 
the support (— cxd, when a < —1. As with the positive skewness case. Fa is an increasing 

function of B^ on both sides of the discontinuity at f3. We thus divide the support of B]^ into 
the two regions 

(—00, (3) and /3, 



7B_ 

with equivalent supports of Fa given by (— 1,cxd) and (—00,0:], in which Fa and B^ arc mono- 
tonically related with no discontinuities. Thus we define the CDF of Fa as 

1 +-Fb^ (/i-H/);7b) --^bj, (/3;7b) -K / < cx) 
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for 7b < and a < — 1. 

Now consider the case where /3 > this is equivalent to a > — 1. In this case there are 

no discontinuities in the support of _B^, so i^A and B"^ are monotonically related everywhere. 
The support for i?^ given by (— oo, —2/75] is equivalent to the support for of (— l,a]. Thus 
the CDF of f A is defined as 







1 



-00 < / < -1 



:FB'{h-\f):iB) -l<f<a 



(10) 



a < / < 00 



for 7b < and a > —1. 

Using these results, the approximate p- value of an observed _Fa can be readily obtained without 
permutations. On computing the pcrmutational mean ^b, variance a^, and skewness 75, and ad- 
ditionally a if 7b < 0, the p- value is given by (^-Pa; MT7 cb, 75^ , where J-p^^ (•; /^t, o-b,^b) 
is the CDF chosen for the specific case of skewness and a value. 

For the given case of skewness and a value, the PDF of i^A, denoted fp^ {f;fj.T,crB,jB), is 
given in terms of /b= {b]jB) by differentiating the CDF. Thus we have that 



fF^{f;fJ-T,(TB,lB) = 



df 



^b(1 + /P^"- 
where the range of / is given by the selected case of CDF. 



.fB-^ih-\.fynB) 

fB^Ah-\f);jB), 



5 Simulation Results 

5.1 Comparison with ANOVA and MANOVA 

Given that F/^ equals the ANOVA F statistic up to a constant as a special case for univariate 
data, we verify that the distribution of i^A approximates that of the ANOVA F statistic well as 
N and G increase. Also, since is related to Hotelling's as a special case for multivariate 
data with G = 2, we verify that our proposed distribution, transformed via (2), approximates the 
distribution of well as increases. That is, we aim to show that for the special cases the DBF 
test is approximately equivalent to the ANOVA F and Hotelling's tests, respectively. 

For the univariate case, data was generated under the null and the DBF statistic using the 
Euclidean distance measure, cIe, and the ANOVA F statistic were computed. P-values were found 
by comparing against their respective distributions. For N = 40, 100,500, 1000 and G = 2,4,5, 
the k^^ Monte Carlo run consisted of simulating yi, . . . , j/at ^ N{fik,crl), where fik ^ 10, 10), 

^ [/(0,10), and U{a,b) denotes the Uniform distribution over [a,b]. The mean and standard 
deviation of the absolute differences between the p-values obtained for B = 200 Monte Carlo 
simulations are reported in Table 1. It can be seen that as TV and G increase, the absolute 
difference between the p-values decreases, thus showing that the approximate distribution of the 
DBF statistic behaves as expected in this case. 
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Table 1: Mean (and standard deviation) of the absolute differences between p- values of the DBF statistic, 
and ANOVA F (P = 1) and Hotelling's (P = 10) statistics, under the null for 200 Monte Carlo runs. 









P 


= 1 






P 


= 10 


N 


G 


= 2 


G- 


= 4 


G 


5 


G 


= 2 


40 


0.0252 


(0.0342) 


0.00670 


(0.00535) 


0.00565 


(0.00464) 


0.004702 


(0.00272) 


100 


0.0137 


(0.0225) 


0.00306 


(0.00267) 


0.00254 


(0.00196) 


0.00200 


(0.00121) 


500 


0.00474 


(0.00951) 


0.000594 


(0.000525) 


0.000541 


(0.000383) 


0.000392 


(0.000260) 


1000 


0.00276 


(0.00599) 


0.000344 


(0.000272) 


0.000278 


(0.000196) 


0.000212 


(0.000132) 



For the multivariate case, the DBF statistic using the Mahalanobis-like distance measure dr, 
and the Hotelling's statistic were computed. P-values were found by comparing against their 
respective distributions. For N = 40, 100, 500, 1000 and P = 10, the fc"^ Monte Carlo run consisted 
of simulating yi,...yN ^ iVp(/Xfc,Sfc), where fik (^ifei, • . • , A^Pfc)^ with ^ijk ~ J7(-6,6) for 
j = 1, . . . , P, and Sfc a random Wishart matrix of size P x P. The mean and standard deviation 
of the absolute differences between the p-values obtained for B = 200 Monte Carlo runs are 
reported in Table 1. As increases the difference between the p-values decreases, showing that 
the DBF and Hotelling's tests are approximately equivalent as N increases. 

A further experiment was performed to show that, as A^ increases, the proposed approximate 
null distribution of the DBF statistic approximates the true ANOVA F and Hotelling's dis- 
tributions, on applying the required transformations, quite well. In particular, we show that it 
yields a better approximation than a permutation-based CDF, especially when the number of 
permutations is low. 

For P = 1, G = 2, and each oi N = 50, 70, one set of univariate observations were generated 
under the null from a Normal distribution as above. The DBF null CDF, suitably transformed, 
and the ANOVA F CDF were obtained, and the Kolmogorov-Smirnov (KS) statistic was used to 
compute the difference between these distributions. This statistic is computed as the maximum 
distance between two vectors representing the CDFs of interest; we used a vector of 1000 equally 
spaced points across the range of the approximate DBF distribution. For the given dataset for 
each A^, and for each oi B ^ 200 Monte Carlo runs, we used an increasing set of Monte Carlo 
permutations to compute the permutation CDF of the DBF statistic. We used lO'^, 10'*,5 x 10* 
and 10^ permutations, so that for each Monte Carlo run, 10"^ Monte Carlo permutations were 
enumerated, then 9 x lO'^ Monte Carlo permutations were added to yield the larger set of 10"' 
permutations and so on. For each of these 4 sets of permutations the KS statistic depicting 
the difference between the DBF permutation CDF and the ANOVA F CDF was computed. This 
yielded an empirical distribution of 200 KS statistic values for each set of permutations. The results 
of this experiment are shown in plots (a) and (b) of Figure 4. We see that for A^ — 50, using 
more than 5 x 10* permutations yields a permutation distribution which is directly comparable 
with our approximate distribution. For A^ ~ 70, however, the approximate DBF distribution 
better approximates the true underlying ANOVA F distribution than the permutation distributions 
typically used in practice; typically not more than 10^ permutations are used for real data analyses. 

For P = 10, G = 2 and A^ = 50, one set of multivariate observations were generated under the 
null from a Multivariate Normal distribution as described above. The DBF null CDF, suitably 
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transformed, and the Hotclling's CDF were obtained. Repeating as above, and using the 
KS statistic to quantify the difference between the transformed DBF permutation CDF and true 
Hotelhng's CDF, the results are given in plot (c) of Figure 4. We see that for = 50 the 
approximate DBF distribution yields a better approximation of the true distribution than the 
permutation distributions. 



(a) P=1 , N=50 (b) P=1 , N=70 (c) P=1 0, N=50 




10^ 10* 5x10" 10^ 10^ 10* 5x10* lo'' 10^ 10* 5x10* 10' 

Number of permutations Number of permutations Number of permutations 



Figure 4: (a)-(b) Empirical distributions of the KS statistic quantifying the distance between the DBF per- 
mutation CDF, suitably transformed, and the ANOVA F CDF, for each set of Monte Carlo permutations. 
The dotted line represents the KS statistic comparing the approximate DBF CDF, suitably transformed, 
and the ANOVA F CDF. (c) Empirical distributions of the KS statistic quantifying the distance between 
the DBF permutation CDF, suitably transformed, and the Hotelling's CDF, for each set of Monte 
Carlo permutations. The dotted line represents the KS statistic comparing the approximate DBF CDF, 
suitably transformed, and the Hotelling's CDF. 



5.2 Approximate null distribution for various data types and distances 

fn this section we illustrate how the approximate distribution of the DBF statistic compares with 
the Monte Carlo permutation distribution for a number of data types and distances. In our setting, 
we explore a range of sample sizes and distance measures for datasets simulated to mimic the real 
datasets introduced in Section 3.1. For vectorial and real- valued data, we consider the Euclidean, 
Bray-Curtis, Canberra, Manhattan and Maximum distances. For vectorial and discrete-valued 
genetic data, we consider the identity-by-state (IBS), Simple Matching, Sokal and Sneath, Rogers 
and Tanimoto I (RTI) and Hamman I distances. For functional data we consider the L2, Visual L2 
and Curvature distances (see Appendix for further details on the genetic and functional distance 
measures). On selection of data type, distance measure and number of samples A'^, the datasets 
are simulated as follows: 

(i) Vectorial and real- valued data: P-dimensional vectors = [yn, ■ ■ ■ ,yip)'^ were simulated 
such that yij iV(0,4) for i — 1,...,N and j = 1,...,1000. For the Bray-Curtis and 
Canberra distances where positive values are required, we take absolute values. 

(ii) Vectorial and discrete- valued data: 5-dimensional vectors = {yn , . . . , j/is)'^ for i = 1, . . . , N 
were simulated based on the observations of the 153 control subjects from chromosome 1 of 
the ADNI dataset introduced in Section 6. N control subjects were randomly selected and 
their minor allele counts at 5 randomly chosen SNPs across the chromosome selected. 
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(iii) Functional data (curves): N curves {yi{t)}^i were simulated over the range t 6 [0,48] by 
using quadratic Bezier curves (Farin, 1992) and a sampling procedure involving smoothing 
splines (Ramsay and Silverman, 2005). N Bezier curves were randomly generated, and N 
1000-dimensional vectors were sampled from these curves at equally spaced points across 
[0, 48] with standard Gaussian error. These were then smoothed via cubic smoothing splines 
to yield the curves; see Minas et al. (2011) for further details of this curve-generation pro- 
cedure. This procedure generates random curves similar to those observed in real longitudinal 
datasets, such as those shown in Figure 1. 

Table 2: Mean (and standard deviation) of the absolute differences between theoretical and permutation 
p- values of the DBF statistic under the null with 200 Monte Carlo runs for vectorial, SNP and functional 
(curve) distances. For — 10, all 10! permutations were used, and for A'^ = 30, 100, 10^ Monte Carlo 
permutations were used. 



Data 
type 


Distance 
measure 


N 

10 30 100 


Vector, 
real- 
valued 


Euclidean 
Bray-Curtis 
Canberra 
Manhattan 
Maximum 


0.0159 (0.0139) 
0.0135 (0.0110) 
0.0154 (0.0133) 
0.0141 (0.0127) 
0.0165 (0.0142) 


0.000706 (0.000554) 
0.000664 (0.000566) 
0.000762 (0.000635) 
0.000565 (0.000453) 
0.000675 (0.000550) 


0.000317 (0.000266) 
0.000297 (0.000259) 
0.000318 (0.000261) 
0.000314 (0.000254) 
0.000314 (0.000231) 


Vector, 
discrete- 
valued 


IBS 

Simple Matching 
Sokal and Sneath 
RTI 
Hamman I 


0.0223 (0.0180) 
0.0222 (0.0206) 
0.0217 (0.0187) 
0.0237 (0.0197) 
0.0211 (0.0201) 


0.00507 (0.00484) 
0.00321 (0.00301) 
0.00551 (0.00509) 
0.00212 (0.00189) 
0.00324 (0.00317) 


0.00174 (0.00108) 
0.00152 (0.000964) 
0.00393 (0.00220) 
0.000646 (0.000405) 
0.00158 (0.000988) 


Functional 


u 

Visual L2 
Curvature 


0.0267 (0.0256) 
0.0370 (0.0332) 
0.0515 (0.0502) 


0.00870 (0.00803) 
0.0130 (0.0118) 
0.0262 (0.0238) 


0.00595 (0.00573) 
0.00885 (0.00841) 
0.00880 (0.00975) 



We compared the theoretical and permutation p-values resulting from applying the DBF test 
under the null for the different distances applied to each data type. For N = 10, 30, 100 and G = 2, 
B = 200 Monte Carlo runs were performed, where for each run data was generated under the null, 
i.e., no group effect. For N = 10, all A'^! permutations were used to compute the permutation 
p-value, but for A^ — 30, 100, a Monte Carlo set of 10^ permutations was used. The theoretical 
and permutation p-values were computed, and the mean and standard deviation of the absolute 
difference between these for each combination of data type, distance measure and A^ are reported 
in Table 2. As expected, the absolute difference between the p-values decreases as A^ increases for 
each distance measure applied to each data type. 

6 A genome-wide association study of Alzheimer's disease 

In traditional case-control association studies, subjects are genotyped for a comprehensive range of 
genetic markers across the entire genome, and markers associated with disease risk are sought. The 
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objective of such studies is to identify genetic variants in the human genome that are associated 
with disease risk (see, for example, Altshuler et al. (2008) and Pearson and Manoho (2008) for 
good overviews of traditional GWA studies). 

Genetic variations are captured by observing SNPs. Human SNPs are biallelic genetic mark- 
ers, and as such, the genotype of an individual at a given SNP is represented by one of three 
combinations of two alleles; a major allele occurring more commonly in the study cohort, and a 
minor allele which is less common. The possible combinations of alleles are 'major, major', 'ma- 
jor, minor' and 'minor, minor', and the genotype of the individual at a given SNP is summarized 
by the minor allele count. The genotype of individual i at a given SNP k, denoted t/ifc, is thus 
represented by either a 0, 1 or 2 corresponding to homozygotes for the major allele, heterozygotes 
and homozygotes for the minor allele, respectively. 

A common approach of scanning the genome in search of causal variants is to group SNPs 
together into subsets where it is plausible that some dependence exists between them, for example, 
if the SNPs are in the same gene or biological pathway; generally referred to as the multi-locus 
approach (Mukhopadhyay et al., 2010; Wu et al., 2010; Yang et al., 2009). Sliding windows which 
partition the genome into overlapping subsets of SNPs of the same length have also been proposed 
to group SNPs together, and ensure coverage of the entire genome without omitting possibly useful 
information in intergenic regions. 

We describe an application of the DBF test with the approximate inference approach of Section 
4 by conducting several multi-locus GWA studies on the Alzheimer Disease Neuroimaging Initiative 
(ADNI) cohort. The available dataset consists of 254 subjects, 101 cases of AD and 153 controls, 
all genotyped at 316, 348 SNPs. We assume the genome has been partitioned into subsets of P = 5 
SNPs so that, for each subset, each of the N individuals in the given study cohort is represented by 
the discrete-valued P-dimensional vector {yi = [yn, ■ . ■ , yjp)}fLi- This results in a total number 
of 316, 260 SNP sets to be compared across the two populations. For each sliding window, five 
genetic distances (IBS, Simple Matching, Sokal and Sneath, Rogers and Tanimoto I and Hamman 
I) were applied, and the DBF statistic and corresponding p-value computed using the approach 
described in Section 4. Due to the multiple testing problem, a genome-wide significance threshold 
of 10~^ was set to identify statistically significant SNP sets. 

In Figure 5 we provide the corresponding Manhattan plot which depicts the significant SNP 
subsets across the entire genome for the Sokal and Sneath distance measure, showing the greatest 
effects around chromosomes 18 and 19. The results of all distance measures were summarized by 
the unique SNP and gene combinations identified; see Table 3. All significant SNPs are identified in 
chromosomes 18 and 19. In particular, chromosome 19 contains two genes, AP0E4 and TOMM40, 
which are the major genetic variants found in many studies (see, for example, Braskie et al. (2011) 
and Shen et al. (2010)). Other previously reported genetic variants that overlap with our findings 
include AP0C2, AP0C4, PVRL2 and CLPTMl (Takei et al, 2009; Yu et al, 2007). The DCC 
gene has also been previously identified (Bredesen, 2009; Lourenco et al, 2009). 

As an illustrative comparison with the permutation approach, in Figure 6 we compare the 
approximate distribution of the DBF statistic under the null with the null permutation distribution 
obtained using by 10^ Monte Carlo permutations. This was done for the first sliding window of 
chromosome 19 and three genetic distances. Here again it can be clearly seen that the approximate 
null distribution of the DBF statistic provides a good fit. 
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-logtpvaluM) across all chramosomes with the Sokal and Sneatti distance measure 



8 - 



1 




Figure 5: Manhattan plot of the -log(pvalues) computed across the genome for the Sokal and Sneath 
distance measure. Each point represents a window containing a muhi-locus SNP subset consisting of 5 
adjacent SNPs. The dashed line represents the genome-wide significance threshold of — log(10~^). The 
black and grey colours were used to distinguish between adjacent chromosomes. 



IBS 



Sokal and Sneath 



Rogers and Tanimoto I 





T 1 1 1 1 r 

-0.01 0.00 0.01 0.02 0.03 0.04 



1 1 1 1 1 1 r 

0.005 0.005 0.015 0.025 



Figure 6: The approximate PDF of Fa for the IBS, Sokal and Sneath and Rogers and Tanimoto I 
distance measures versus the sampling values of F/\ obtained with 10*' Monte Carlo permutations for 
the first sliding window of SNPs for chromosome 19. This window contains 254 samples of the 5 SNPs 
rsl2459906, rsll878315, rsl0409452, rs3813154 and rs895344. 



We have also compared our results with those obtained using a statistical procedure developed 
specifically for multi-locus case-control studies, the kernel-machine logistic regression approach 
of Wu et al. (2010), which we denote LKMT. This method makes use of similarities rather than 
distances via kernel functions if (•, •); note that kernels are related to distances, for example, 
the IBS kernel between individuals i and j is just one minus their IBS distance. Since LKMT 
makes use of an approximate distribution rather than permutations, we are able to compare the 
two methods without major computational issues. LKMT consists of a logistic kernel- machine 
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Table 3: Significant SNPs and genes identified for each distance measure with the genome- wide significance 
threshold of 10~^. The chromosome in which the SNPs were identified are given, in addition to the p- value 
of the sliding window containing the given SNP. 



Distance measure 


SNP 


Gene 


Chromosome 


P-valuc of window 




rs2075650 


TOMM40 


19 


1.626 X 10"^° 




rs8106922 


TOMM40 


19 


3 600 X 10~^° 




rs5167 


AP0C2, AP0C4 


19 


4.844 X 10~^° 




apoe4 


APOE 


19 


4.844 X 10-1° 


IBS 


rs3760627 


CLPTMl 


19 


1.237 X lO"'' 




rs405509 


APOE 


19 


3 080 X lO"'' 




rs2075642 


PVRL2 


19 


6.449 X 10"* 




rs6859 


PVRL2 


19 


6.449 X 10"* 




rsl57580 


TOMM40 


19 


6.449 X 10"* 




apoe4 


APOE 


19 


8.458 X 10"" 




iS4UooUy 






o.4t)o X iU 




rs2075650 


TOMM40 


19 


8 458 X 10"ii 




rs8106922 


TOMM40 


19 


8 458 X 10"" 




rsl 57580 


TOMM40 


19 


2 104 X 10"* 




rsl 222988 


DCC 


18 


5 736 X 10"* 

• 1 Kjyj ^ -L\J 


Sokal and Sncath 


rsl2960771 
rsl560531 


DCC 
DCC 


18 
18 


5 736 X 10"* 
5.736 X 10"* 




rs2960617 


DCC 


18 


5.736 X 10"* 




rs3862684 


DCC 


18 


5.736 X 10"* 




rs2075642 


PVRL2 


19 


7.498 X 10"* 




rs4803766 


PVRL2 


19 


7.498 X 10"* 




rs6859 


PVRL2 


19 


7.498 X 10"* 




rsl7748116 


DCC 


18 


8.212 X 10 ^ 




rsl57580 


TOMM40 


19 


4.129 X 10"i" 




rs2075650 


TOMM40 


19 


4.129 X 10"io 


Rogers and 


rs8106922 


TOMM40 


19 


1.067 X 10"9 


Taninioto I 


rs5167 


AP0C2, AP0C4 


19 


6.915 X 10"* 




apoe4 


APOE 


19 


6.915 X 10"* 




rs405509 


APOE 


19 


6.915 X 10"* 




apoc4 


APOE 


19 


2.738 X 10"i° 


Simple Matching 
Hamman I 


rs405509 
rsl57580 


APOE 
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regression model where the probability that a given subject is a case subject is modelled as some 
function of the similarities between that subject and all other subjects. We denote the case-control 
status of the N subjects by {si}fLi, where Si = for controls and s; — 1 for cases. The model 
is then given by logit[P(si = 1)] = /3o + m{yi) for i = 1, . . . , A^, where /3q is an intercept, and 
m{zi) = ^j^'^iyiiVj) foi' some constants {5j}f^i so that m(-) is completely specified by the 

similarities between the subjects depicted by the kernel function K{-, •). The null hypothesis of no 
group effect is stated as Hq : h{yi) = for z = 1, . . . , A^, and the variance-component score statistic 
Q = (s — 0o)'^K{s — $q)/2 is used to test this, where s = (si, . . . ,sn)'^ , $q = $o{^, ■ ■ ■ , 1)"^ and 
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-log(pvalues) of the DBF test and LKMT applied across chromosome 19 with the IBS distance measure 




2000 3000 
Sliding Window 



Figure 7: — log(pvalues) computed across chromosome 19 using the DBF and LKMT tests with the 
IBS distance measure and kernel function, respectively. Each point represents a multilocus SNP subset 
consisting of 5 adjacent SNPs, and the dashed line represents the transformed genome-wide significance 
threshold of -log(10-''). 



K = {K{yi,yj)}f^j^i- The distribution of Q under the null hypothesis is approximated by a 
Chi-squared distribution. 

The p- values obtained by LKMT are of the same order as our approach with the IBS distance 
measure when using the comparable IBS kernel. For instance, for chromosome 19, which is the 
chromosome containing the smallest p-values. Figure 7 provides a visual comparison of the two 
methods when using the IBS measure; similar results are obtained for other chromosomes (not 
shown). Although a rigorous power comparison is beyond the scope of this study, it can be noted 
that both methods identified the same causal SNPs in Chromosome 19 at the significance level 
of 10"'^, i.e., the ones listed in Table 3. Thus the well-known APOE and TOMM40 genes were 
identified by both approaches. 



7 Conclusion 

For large datasets, such as those arising in biological applications, it is desirable to apply distance- 
based tests for equality between groups without permutations. This requires that the discrete 
sampling distribution of the chosen statistic under the null is approximated by a suitably cho- 
sen continuous distribution. In this article we focused on the DBF statistic due to the intuitive 
distance-based variance decomposition it uses. We described some results relating the DBF statis- 
tic to classical MANOVA and ANOVA statistics. Its connection with ANOVA is well-known, 
allowing an exact null distribution of the DBF statistic to be obtained when applied in the spe- 
cial case of univariate Normal data with the Euclidean distance measure. Its connection with 
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MANOVA, however, is not so well-known. We presented the result that for multivariate Normal 
data with G ~ 2, the DBF statistic applied with a Mahalanobis-like distance measure is related to 
Hotelling's statistic. From this connection the exact null distribution of the transformed DBF 
statistic is attainable. 

For more general data types, and corresponding distance measures, where the exact null dis- 
tribution of the DBF statistic is unknown, we considered its permutation distribution. We showed 
that it depends on the permutation distribution of the between-group variability component of the 
distance-based variance decomposition. On presenting the skewed characteristics of the between- 
group variability for real biological datasets, we justified the use of the Pearson type III distribution 
to model its skewed nature. We then used its monotonic relationship with the DBF statistic to 
derive an approximate distribution for the DBF statistic. 

For a range of data types and distance measures we provided evidence to support the per- 
formance of the proposed approximation. We also used it to conduct several case-control GWA 
studies on the ADNI dataset in order to show its applicability to a large dataset. This study 
identified a number of genes inline with validated genetic risk factors for Alzheimer's disease. We 
also showed that the proposed approach performs comparably to an existing test statistic that has 
been specifically designed for case-control association studies. 

The R software for performing the DBF test with the approximate null distribution is available 
from http : //www2 . imperial . ac .uk/ ^gmontaina. 
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Appendix 



A Multi-locus genetic distance measures 

Assume N individuals in the given study cohort are observed at P SNPs, yielding the discrete- 
valued P-dimensional vectors {yi ~ {yn, ■ • ■ , yip)}iLi- The identity-by-state (IBS) distance mea- 
sure is commonly used, and measures the proportion of risk alleles (minor alleles) shared between 
individuals across the SNPs considered (Wessel and Schork, 2006; Wu et at, 2010; Mukhopadhyay et 
2010). The distance between individuals i and j is defined by 

1 P 

k=l 

where s{yik,yjk) = if y^k = and yjk = 2, or if yn, = 2 and yjk = 0, s{yik,yjk) = 1 if y^fe = 1 
and yjk ^ 1, or if y^k = 1 and yik ^ 1, and s{yik,yjk) = 2 if y^k = yjk- This distance takes 
values between and 1. Weighted versions of this distance also exist where a weight is attached 
to each of the P SNPs depending on properties such as functional significance or frequency of the 
minor allele (Wessel and Schork, 2006; Li et al, 2009). Genetic distances have also been proposed 
based on the contingency table between individuals i and j containing the frequency that each 
combination of yik and yjk values occur over the range of SNPs considered (Selinski and Ickstadt, 
2005); see Table 4 below. 

Table 4: Contingency table containing the frequency of a given combination of minor allele count between 
individuals i and j over the P SNPs. niki is the frequency of individual i having k minor alleles and 
individual j having / minor alleles. 



Individual i 





Individual j 
1 


2 





moo 


moi 


mo2 


1 


mio 


mil 




2 


m2o 


m2i 





The key statistics in this table are the number of complete matches of the minor alleles, 
771+ = '}Z'k=o^kk, and the number of mismatches, = P — m+, where the total number of 
comparisons made is P. Based on these quantities, the following 'matching coefhcient' distance 
measures can be defined; the Simple Matching distance 



the Sokal and Sneath distance 



dsM[yi,yj) = 1 - -p-, 



dss(yi,Vj) = 1 - 



m+ + im ' 
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the Rogers and Tanimoto I distance 



dRTiiyijV]) = 1 



m ' 



m+ + 2m 
and the Hamman I distance 

s*{yi,yj) 



dHi{yi,yj) = 1 



max^.j{s*{yi,yj)}' 



where s*{yi,yj) = SHi{yi,yj) + \ ^vtVi,j{sHi{yi,yj)}\ with SHiiyi,yj) = (m+ - m )/P. All of 
these distances take values between and 1. 



B Distance measures between curves 

Assume a set of TV curves {yi{t)}fLi defined for < 6 r where all curves have been defined over the 
same range t. The L2 distance represents the area between curves, and hence the magnitude of 
difference between them (Ferraty and Vieu, 2006; Minas et al, 201 f; Salem et al., 2010), and is 
defined by 

dL{yt,yj)= (^J {yi{t) - y,j{t)f dt 

The curvature distance quantifies the difference in the rate of change between two curves 
(Ferraty and Vieu, 2006; Minas et al.., 2011), and is defined by 



dc{yi,y]) 



{y'l{t)fdt- I {y';{t)Ydt 



The visual L2 distance quantifies the difference in the scale-invariant shape between curves, 
analogously to the difference detected by the human eye (Marron and Tsybakov, 1995; Minas et al., 
2011). For this distance, curves and yj are scaled both in time and magnitude, so that 
their values range between and 1 in time period [0,1]; denote these y|(i) and y|(i) where 
t e [0,1]. They are then represented as infinite sets of points in the two-dimensional plane, de- 
noted p, = {{t,yl{t)) I t G [0,1]} and pj = {(i,2/|(t)) | t G [0,1]}. The visual L2 dist ance is then 
defined by 



.1 /.I 

dv{y^,y,)^[ 1^ 5lmt + 5%{t)dt 



where 5ij (t) is the minimum Euclidean distance between the point y^ (t) and all points pj repre- 
senting yj; note that Sij{t) is not necessarily equal to 5ji{t). 
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C Derivation of the CDF of Fa for 75 > 

First consider the case where — cxd < / < — 1. By inspection of Figure 3 (a), we see that we need 
only consider the relationship between and B]^ where > /?. We thus have that 

J^F^{f;f^B,crB,lB) = P{F£^ < f; ^3,^3,13) 

= P{P<B'^<h-\f);^B) 

= P{B-k<h-\f);-fB)-{B-^<P;-fB) 

= J-B^ (/i-1(/);7b) -^i3j,(/3;7i3)- 

Now consider the case where a < f < oo. From Figure 3 (a) we see that we must consider 
the relationship between Fa and B^ for B'^ < /3, while adding the cumulative component of all 
values of Fa for which > /?. That is, 

•^Fi (/;/"B,crB,7B) = Pia < < f; hb,o-b,1b) + P{-oo < < -1;^b,o-b,^b) 
= P < i?i < h-\f);^B^ + F (/3 < < h-\-l);jB) 



J^B-^ {h ^(/);7b) --^s^, ( — ;7B ) {oo;jb) - ^b-^ {P;1b) , 

7-B 



and since — 2/7^ < F^ < 00, we have Fsj^ {—"^/ib'^Ib) = and F^j^ (00; 7^) = 1, so that 

{f;fj.B,crB,jB) = l + {h^^{f);iB) -^b-^ (/3;7s)- 
Thus we have that the cumulative distribution function of Fa is given by 



if;f^B,crB,lB) = 



^Bi,{h-Hf);jB) -TB^^iP;jB) -oo</<-l 
1 +Fb^ {h-^{fy,lB) -Tb'^ (/?;7b) a < / < 00 



for jB > 0, as required. 

Next we show that this is a valid CDF by showing that the following conditions are satisfied: 

(i) The limit of Ff^(/) as / tends to —00 from the right is 0, and as / tends to 00 from the 
left is 1. That is 

lim [Ff^ 0-5,75)] = and lim [Fp^ (/; pis, 0-5, 75)] = 1. 

These follow because 

\im [TF^{f;tiB,(JB,lB)] = Inn [Fb^ (fe; 75)] - Fs- (/?; 7b) 

= -^b|,(/3;7b) - -7^b|,(/3;7b) 
= 0, 
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and 

lim [J'fa (/;MB,crB,7B)] = 1+ lim [-Fs?, 7b)] - -^si (/?; 7s) 

= 1. 

(ii) -Fp^if) is a monotone, non-decreasing function of /. That is, for /i < /2, J^F^ifi) < 

-Ff^(/2). 

For — oo < /i < /2 < —1, we have that 

(/i;Ms,crB,7B) = J^B'X (/^"^ (/i); 7b) --T^sj^ (/3;7b) 

J^F^ (/2;MB,CrB,7B) = -T^B^ (/i"\/2);7b) - -^^B^ (^;7b), 

so that J^F^ [h]t^B,cFBnB) - J^F^ (/2; fJ-B^o-BHE) is equal to 

-Fbj, (/i^^(/i);7B) -^b^ {h~'if2y,iB) . 

This is negative since h^^{fi) < h^^if2) and •^b^(^;7b) is a non-decreasing, monotone 
function of b (as it is a vahd CDF). Hence J-p^Ui) < J^F^{f2), as required. 

For a < /i < /2 < oo, we have that 



^F^{.fi;i-i.B,(7B,'lB) = 1 + -7^B|^ (/i ^(/i);7b) --T^Bj^ (/3;7s) 

J^F^{f2;fJ.B,crB,jB) = 1 + -T^BJ^ (/i"\/2);7B) - -T^BJ^ (/3;7b) 



SO that J'fa (/i;mb,o-s,7b) - {f2; fJ.B,(TB,^B) is equal to 

-^B^ (/i-'(/i);7b) --Fb^ (/i-^(/2);7B) . 

As before, this is negative since h~^{fi) < h'^{f2) and J^B^^{b;jB) is non-decreasing and 
monotone. Hence J-F^ifi) < -Ffa(/2)i as required. 

Finally, let /i = and ^2 — 01. Then 

•Ffa (/i;/^b,ctb,7b) = 1 - -Fsj^ (^;7b) 

-T^Fi (/2;MB,cri3,7B) = 1 + J'sj^ 7B^ - -T^BJ, (/?; 7b) , 

and since ^sj^ {~2/^b'i 1b) = 0, we have that J^Fa (/i) ^ -^Fa (/2) holds at the discontinuity, 
(iii) Tf^ if) is continuous from the right, that is 

lim [J"fa (/ + e; MB, 0-5,75)] = Tp^ {f;fJ.B,crB,lB) ■ 
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For —oo < / < — 1 wc have that 



^Hm [J^B^^ {h-\f + e); 7b)] - Tb^^ (/3; 7s) 



and for a < / < 00 wc have that 



lim [J'fa (/ + e; Mb, crs, 7b)] 



1+ hm [J-B^ (/i~^(/ + e);7B)] --Fbj^(/3;7b) 
{h-\f);jB) - J'b^^ (/3;7b) 



-^Fi {f;fJ'B,<^B,jB) , 



as required. 

Thus J-Fa (/; Mb, fs, 7_b) is a vahd CDF for jb > 0. 
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